Skip to content

Add tool-execution observability (0063)#178

Merged
chris-colinsky merged 3 commits into
mainfrom
feature/0063-tool-execution-observability
Jun 22, 2026
Merged

Add tool-execution observability (0063)#178
chris-colinsky merged 3 commits into
mainfrom
feature/0063-tool-execution-observability

Conversation

@chris-colinsky

Copy link
Copy Markdown
Member

Implements accepted proposal 0063 (spec v0.69.0): makes a caller's tool execution observable. This is the execution-side complement to 0076 (the request side), joined by tool_call_id, and the last feature of the v0.15.0 cycle. Pin advances v0.68.0 to v0.69.0 (0063 is the only proposal in the delta).

The primitive

A model requests tools in its completion; the caller executes them in node-body code (OpenArmature does not run, select, loop, or feed back tools), so that execution was invisible to the observer stream. with_tool_call is a node-body instrumentation scope, a context manager modeled on with_active_prompt:

from openarmature.observability import with_tool_call

with with_tool_call("get_weather", {"city": "Paris"}, tool_call_id="call_abc") as scope:
    result = await get_weather(city="Paris")
    scope.set_result(result)

You run the tool inside it and report the outcome with scope.set_result(...). On a clean exit it dispatches a ToolCallEvent; if the tool raises, it dispatches a ToolCallFailedEvent and re-raises (it observes, it does not swallow). I chose the context-manager shape over an inline-wrapping helper to stay consistent with the existing with_active_prompt node-body primitives.

The events

Two typed variants on the graph-engine observer union, mirroring the LLM completion/failure pairing. Both carry the identity/scoping baseline plus tool_name, tool_call_id (the link back to the requesting LlmCompletionEvent.output_tool_calls entry, or null for a standalone instrumented function), arguments, latency_ms, and call_id. ToolCallEvent adds result; ToolCallFailedEvent adds error_type / error_message and deliberately carries no error_category (tool code is arbitrary, with no closed failure taxonomy).

Rendering

  • OTel: an openarmature.tool.call span (note .call, not .complete) parented under the calling node, with OA-namespace openarmature.tool.{name,call.id,call.arguments,call.result} attributes and the standard error.type on failure. The Development gen_ai.tool.* / execute_tool surface is mirrored, not emitted in v1, so a future cutover is a prefix swap.
  • Langfuse: a dedicated Tool observation (asType="tool", not a Generation) under the node's Span observation, with arguments / result as input / output and the tool name / call id in metadata, ERROR level on failure. This is python's first non-Span/Generation observation type, so tool() was added to the client Protocol, the in-memory recorder, and the SDK adapter.

arguments and result are payload, gated by disable_provider_payload (no new flag); disable_llm_spans does not gate the tool span.

Two fixes from the self-review

  • Payload serialization in both observers now uses default=str, so an opaque tool result that JSON cannot natively encode (a Pydantic model, a datetime) renders via its str() instead of raising inside the observer and losing the whole span/observation.
  • The Langfuse SDK adapter now back-dates the Tool observation (generalized the back-dating helper to wrap LangfuseTool), so the live observation's duration reflects the tool latency, matching the Generation path.

Scope

v1 ships the inline bracketing form only. The deferred start/complete split (a tool result landing in a later turn) is a spec MAY with no fixture, deferred until a consumer needs event-driven tool execution; inline-only is fully conformant.

Tests

  • Unit: the primitive (dispatch, re-raise, tool_call_id linkage, the serialization fallback), the OTel span, the Langfuse Tool observation, and the adapter back-dating (verified against the mock SDK).
  • Conformance: fixtures 092-098 run through a tool-graph runner (092-095 typed-event-collector, 096/097 OTel span_tree, 098 Langfuse Tool observation). The fixture-parser schema gained the calls_tool directive and the record state type; 092-095 defer-parse like the 050-056 typed-collector fixtures.
  • Full suite green; ruff + pyright clean; mkdocs build --strict clean.

The model requests tools in its completion (0076); the caller runs
them in node-body code, which was invisible to observers. Add the
with_tool_call instrumentation scope -- a context manager (like
with_active_prompt) the caller wraps a tool execution in -- plus the
typed ToolCallEvent / ToolCallFailedEvent it dispatches at outcome
time (re-raising on failure; the failure event carries error_type /
error_message and deliberately no error_category).

The OTel observer renders an openarmature.tool.call span (OA-namespace
attributes, error.type on failure); the Langfuse observer renders a
dedicated Tool observation (asType "tool"), which adds a tool() method
to the client Protocol, the in-memory recorder, and the SDK adapter.
Arguments and result are payload, gated by disable_provider_payload.

Also harden both observers' payload serialization with default=str so
an opaque tool result JSON can't encode renders via str() instead of
crashing the observer, and back-date the Langfuse Tool observation
(generalize the back-dating helper to wrap LangfuseTool).

Implements proposal 0063 (graph-engine 6, observability 5.5 / 8.4).
Advance the spec pin v0.68.0 -> v0.69.0 across the four sync points
(submodule, __spec_version__, pyproject, conformance manifest) and
the smoke assertion; regenerate the bundled AGENTS.md.

Wire conformance fixtures 092-098 through a tool-graph runner
(calls_tool / calls_llm / update nodes) dispatching across the
typed-event-collector, OTel span-tree, and Langfuse Tool-observation
assertion shapes; teach the fixture-parser schema the calls_tool
directive and the record state type, and defer-parse 092-095 (the
typed-collector shape, like 050-056). Record proposal 0063
implemented, document the with_tool_call scope and the Tool
observation, and add the CHANGELOG entry. Also reconcile a stale
LangfuseClient method count and add the Tool observation to the
Langfuse-mapping overview.
Copilot AI review requested due to automatic review settings June 22, 2026 22:59
Comment thread tests/unit/test_tool_call.py
Comment thread tests/unit/test_tool_call.py
Comment thread src/openarmature/observability/langfuse/client.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements accepted proposal 0063 (spec v0.69.0) by adding a node-body with_tool_call instrumentation scope that dispatches typed tool-execution events, and rendering those events in the OTel and Langfuse observers.

Changes:

  • Added with_tool_call + ToolCallScope, plus ToolCallEvent / ToolCallFailedEvent to the observer event union.
  • Rendered tool execution as openarmature.tool.call (OTel span) and as a dedicated Langfuse tool observation; updated Langfuse adapter/client protocol accordingly.
  • Bumped spec pin to v0.69.0 and added unit + conformance coverage for tool-execution observability.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/test_tool_call.py Unit tests for with_tool_call dispatch/re-raise behavior and edge cases.
tests/unit/test_observability_otel.py Unit tests for OTel openarmature.tool.call span attributes, gating, failure mapping, serialization fallback.
tests/unit/test_observability_langfuse.py Unit tests for Langfuse dedicated tool observation rendering and gating/serialization behavior.
tests/unit/test_observability_langfuse_adapter.py Unit tests ensuring back-dated tool observations route via private OTel tracer path.
tests/test_smoke.py Updated spec version assertion to 0.69.0.
tests/conformance/test_observability.py Added runner + assertions for tool observability fixtures (092–098).
tests/conformance/test_fixture_parsing.py Deferred parsing notes for tool typed-collector fixtures (092–095).
tests/conformance/harness/directives.py Added calls_tool directive schema and mock tool spec.
tests/conformance/adapter.py Added record state type mapping for tool fixtures.
src/openarmature/observability/tool_call.py New implementation of with_tool_call scope and ToolCallScope.
src/openarmature/observability/otel/observer.py OTel rendering for tool events; safer JSON serialization with default=str.
src/openarmature/observability/langfuse/observer.py Langfuse rendering for tool events; safer JSON serialization with default=str.
src/openarmature/observability/langfuse/client.py Extended protocol and in-memory client to support tool observations.
src/openarmature/observability/langfuse/adapter.py Added adapter support for tool() and generalized back-dating helper.
src/openarmature/observability/correlation.py Extended dispatch typing to include tool event variants.
src/openarmature/observability/init.py Exported with_tool_call and ToolCallScope.
src/openarmature/graph/observer.py Included tool events in observer union typing.
src/openarmature/graph/events.py Added ToolCallEvent and ToolCallFailedEvent dataclasses and exports.
src/openarmature/AGENTS.md Updated bundled agent doc spec version to v0.69.0.
src/openarmature/init.py Updated __spec_version__ to 0.69.0.
pyproject.toml Updated [tool.openarmature].spec_version to 0.69.0.
docs/concepts/observability.md Documented with_tool_call and backend renderings; updated Langfuse section wording.
conformance.toml Updated spec pin to v0.69.0 and marked proposal 0063 implemented.
CHANGELOG.md Added release notes entry for tool-execution observability; updated spec pin summary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/openarmature/observability/tool_call.py Outdated
Comment thread docs/concepts/observability.md Outdated
Comment thread tests/conformance/test_observability.py Outdated
- with_tool_call: drop the result sentinel. The scope result defaults
  to None and the event carries it directly; a forgotten set_result and
  a tool that returns None both emit a null result, which is correct, so
  the sentinel produced no distinguishable output.
- observability docs: name tool_call_id explicitly in the Langfuse Tool
  metadata sentence (the feature has both tool_call_id and call_id).
- conformance: _assert_langfuse_observation_tree consumes each matched
  observation, so two same-shape expected siblings can't both bind to
  one actual observation.
@chris-colinsky chris-colinsky merged commit 0479bf0 into main Jun 22, 2026
6 checks passed
@chris-colinsky chris-colinsky deleted the feature/0063-tool-execution-observability branch June 22, 2026 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants